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Abstract 

An asymptote is derived from Turing's local reestimation formula for population 
frequencies, and a local reestimation formula is derived from Zipf's law for the 
asymptotic behavior of population frequencies. The two are shown to be qualita- 
tively different asymptotically, but nevertheless to be instances of a common class 
of reestimation-formula-asymptote pairs, in which they constitute the upper and 
lower bounds of the convergence region of the cumulative of the frequency func- 
tion, as rank tends to infinity. The results demonstrate that Turing's formula is 
qualitatively different from the various extensions to Zipf's law, and suggest that it 
smooths the frequency estimates towards a geometric distribution. 

1 Introduction 

Turing's formula [Good 1953] and Zipf's law [Zipf 1935] indicate how population frequencies in 
general tend to behave. Turing's formula estimates locally what the frequency count of a species 
that occurred x times in a sample really would have been, had the sample accurately reflected 
the underlying population distribution. Zipf's law prescribes the asymptotic behavior of the 
relative frequencies of species as a function of their rank. The ranking scheme in question 
orders the species by frequency, with the most common species ranked first. The reason 
that these formulas are of interest in computational linguistics is that they can be used to 
improve probability estimates from relative frequencies, and to predict the frequencies of unseen 
phenomena, e.g., the frequency of previously unseen words encountered in running text. 

Due to limitations in the amount of available training data, the so-called sparse-data pro- 
blem, estimating probabilities directly from observed relative frequencies may not always be 
very accurate. For this reason, Turing's formula, in the incarnation of Katz's back-off scheme 
[Katz 1987], has become a standard technique for improving parameter estimates for probabi- 
listic language models used by speech recognizers. A more theoretical treatment of Turing's 
formula itself can be found in [Nadas 1985]. 

Zipf's law is commonly regarded as an empirically accurate description of a wide variety 
of (linguistic) phenomena, but too general to be of any direct use. For a bit of historic con- 
troversy on Zipf's law, we refer to [Simon 1955], [Mandelbrot 1959], and subsequent articles in 
Information and Control. The model presented there for the stochastic source generating the 
various Zipfian distributions is however linguistically highly dubious: a version of the monkey- 
with-typewriter scenario. 

The remainder if this article is organized as follows. In Section 2, we induce a recurrence 
equation from Turing's local reestimation formula and from this derive the asymptotic beha- 
vior of the relative frequency as a function of rank, using a continuum approximation. The 
resulting probability distribution is then examined, and we rederive the recurrence equation 
from it. In Section 3, we start with the asymptotic behavior stipulated by Zipf's law, and 
derive a recurrence equation similar to that associated with Turing's formula, and from this 



induce a corresponding reestimation formula. We then rederive the Zipfian asymptote from 
the established recurrence equation. In Section 4, similar techniques are used to establish the 
asymptotic behavior inherent in a general class of recurrence equations, parameterized by a 
real-valued parameter, and then to rederive the recurrence equations from their asymptotes. 
The convergence region of this parameter for the cumulative of the frequency function, as rank 
approaches infinity, is also investigated. In Section 5, we summarize the results, discuss how 
they might be used practically, and compare them with related work. 



2 An Asymptote for Turing's Formula 

Turing's formula reestimates population frequencies locally: 

** = ( x + 1 )-^ (!) 

Here N x is the number of species with frequency count x, and x* is the improved estimate of 
x. Let N be the size of the entire population and note that 

x 

x=l 

where X is the count of the most populous species and f x is the relative frequency of any 
species with frequency count x. 

Let r(x) be the rank of the last species with frequency count x, where the most frequent 
species is ranked first. This means that quite in general 

X 

r(x) = J2 N k 
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N x = N k ~ E N k = r(x) - r(x + 1) (2) 
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2.1 A continuum approximation 

We first make a continuum approximation by extending N x from the integer points 
0, 1, 2, . . . , X to a continuous function N(x) on [0, oo). This means that 
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Differentiating this w.r.t. x, the lower bound of the integral, yields 



dr(x) d 



dx 

and using the chain rule for differentiation yields 



N(y)dy = -N(x) 



dr dr dx 

df dx df 



Continuum approximations are useful techniques for establishing the dependence of a sum 
on its bounds, to the leading term, and for determining convergence. For example, if we wish to 
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study the sum A; 2 , we note that the corresponding integral / x 2 dx = and conclude 

k=i Jl 3 
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that the sum behaves like n 3 . The exact formula is , so we in fact even got 
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the leading coefficient right. Likewise, we can establish for what values of a the sum ^ k a 

k=i 

converges by explicitly calculating 
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[In x]^° for a = — 1 

for 

a + 1 r 



indicating that the integral, and thus the sum, converge for a < — 1 and diverge for a > — 1. 

We have to be a bit careful with the transition to the continuous case. We will first let N 

x 

become large and then establish what happens for small, but non-zero, values of / = — . So 

although x will be small compared to N, it will be large compared to any constant C. This 
means that 

„ x x -\- C 
t = lim — = lim 

JV^oo N JV^oo N 

for any additive constant C, and we may approximate x + C with x, motivating ~ — 

x + 1 x 

and similar approximations in the following. 
2.2 The asymptotic distribution 

For an ideal Turing population, we would have x = x* . This gives us the recurrence equation 

«•« = -rhr"* (4) 

implying that there are equally many inhabitants for frequency count x as for frequency count 
x + 1. This introduces several additional constraints, namely 

iVi 

x ■ N x = 1 • iVi and thus N x = — (5) 

x 
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N = X ■ Ni and thus f x = — = — 

J N iVi 

We are now prepared to derive the asymptotic behavior of the relative frequency f(r) of 
species as a function of their rank r implicit in Eq. (4). Combining Eq. (5) with Eq. (3) yields 

* = - Nix) . N = -^l. N = * 

df x f 

This determines the rank r( f ) as a function of the relative frequency /: 

r(f) = C-minf (6) 
Inverting this gives us the sought-for function f(r): 

f(r) = e^T = C'e~^ 



Utilizing the fact that the relative frequencies should be normalized to one, we find that 

1 = / f(r)dr = C'-N ie ~^ 



and that thus "Turing's asymptotic law" is 
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f(r) = — e N i 



(7) 



Note that, reassuringly, the relative frequency of the most populous species, fx, is preserved: 

/(I) = ^ = f = fx 

1 _ r-l 

Upon examining the frequency function -rj-e Nl , we realize that we have an exponential 

1 

distribution with intensity parameter — -, the probability of the most common species. This 

. 1 

distribution was created by approximating our original discrete distribution with a continuous 
one. The discrete counterpart of an exponential distribution is a geometric distribution 

P(r) = p-il-pY' 1 r = l,2,... 

parameterized by p, the probability of some outcome occurring in one trial. P(r) can then 
be interpreted as the probability of waiting r trials for the first occurrence of the outcome. 
Thus, Turing's formula seems to be smoothing the frequency estimates towards a geometric 
distribution. 



2.3 Rederiving Turing's formula 

To test our derivation of the asymptotic equation (7) from the recurrence equation (4), we will 
attempt to rederive Eq. (4) from Eq. (7). Since Eq. (7) implies Eq. (6), we start from the latter 
and establish that 
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Inserting this into Eq. (2) yields 
N x = r(x) — r(x + 1 
This means that 

'in ( 1 



N ' ln N +Nlln — 



iVxln' 



iY!ln(l 



N. 



x+l 



1 



In 1 + - 



l)ln 1 



■In 1 



l)ln( 1 + - 



We first note that 
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We also note that the numerator can be written as g(x + 1) — g(x) for g(y) = yln ^1 -\ — ^ , 

f x+1 d . f x+1 ( ( 1\ 1 \ 
which in turn can be written as / —rdiy) dy, i.e., as / I In I 1 H — ) — ) dy. We 
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further note that if A < h(y) < B on (a, 6), then A(b — a) < / h(y) dy < B(b — a). Hence 
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We have thus proved that 



N. 



x+l 



< — and since we assume that x 1, this 



reestablishes Eq. (4) (to the second power of — ). 

3 A Reestimation Formula for Zipf 's Law 

Zipf's law concerns the asymptotic behavior of the relative frequencies f(r) of a population as a 
function of rank r. It states that, asymptotically, the relative frequency is inversely proportional 
to rank: 

f(r) = A 



B + r 



This implies a finite total population, since the cumulative (i.e, the sum or integral) of the 
relative frequency over rank does not converge as rank approaches infinity: 



F(r) 



^ f(k) in the discrete case 
k=i 

f(p) dp in the continuous case 
lim F(r) = lim A ln(B + r) = oo 

A A' 

To localize Zipf's law, we utilize Eq. (2) and observe that r(x) = — B = B, 

Jx X 
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N x +i = r(x + 1) - r(x + 2) = a; + 1 x + 2 = {x + 1) • 
Nr ~ r(x)-r(x + l) A' A' 1 



(9) 



X X + l X -(X + 1) 

This suggests "Zipf's local reestimation formula" 

x* = (x + 2)-^ (10) 

which is deceptively similar to Turing's formula, Eq. (1), the only difference being that it 
x + 2 

assigns more relative-frequency mass to frequency count x. 

x + l 

3.1 Rederiving Zipf's law 

If we rederive the asymptotic behavior, we again obtain Zipf's law. Assuming the recurrence 
equation 

N x+ i = — T-^-^s 



we have that 
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N x +i = N x = iVi = — • JVi « — 
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We again use the equation for the derivative of the rank, Eq. (3), but now 
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Integration yields r(f ) = — + C" and function inversion 

f" 

fir) 



r - C" 

Identifying C" with A and C" with — i? recovers Eq. (8). 

4 A General Correspondence 

If we generalize the rederivation of Zipf 's law in Eq. (11) to p = 2, 3, . . . , x, we find that 

„ x\ „ C 
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We integrate — to get r(/) = + C", yielding a r p- 1 asymptote for f(r). 

Although a nontrivial generalization, it is in fact the case that for real- valued : 1 ^ 9 < x, 

= («) 

results in the asymptote 1 

f(r) = Cr-^T (13) 

The key observation here is that also for real- valued < x in general, 

x\ C 



This means that we have a single reestimation equation 



(* + 9) ■ ^ (14) 



parameterized by the real-valued parameter 9, with the asymptotic behavior 



f(r) = { C '~[- 1 9 * 1 (15) 
' Ce~ Xr = 1 K J 



Although this correspondence was derived with the requirement that 9 < i, we can in view 
of the discussion in Section 2.1 assume that x is not only considerably larger than 1, but also 
greater than any fixed value of 9. The extension to the negative real numbers is straight- 
forward, although perhaps not very sensible. In fact, the convergence region for the cumulative 
of the frequency function as rank goes to infinity, 

°° yoo 

J2f( r ) ° r / f( r ) dr 
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is 9 G [1,2), establishing Turing's formula and Zipf's law as the two extremes of this reestima- 
tion formula, in terms of resulting in a proper probability distribution for infinite populations; 
while the former does so, the latter does not. 



1 I{ 9 = 1, we have the Turing case with an exponentially declining asymptote, cf. Eq. (7). 



This recaptures Eq. (12). Note that the derivation of Zipf's recurrence equation in Eq. (9) of 
Section 3 corresponds to the special case where a = 1, i.e., where 9 = 2. 

5 Conclusions 

The relationship between Turing's formula and Zipf's law, which both concern population fre- 
quencies, was explored in the present article. The asymptotic behavior of the relative frequency 
as a function of rank implicit in one interpretation of Turing's local reestimation formula was 
derived and compared with Zipf's law. While the latter relates the rank and relative frequency 
as asymptotically inversely proportional, the former states that the frequency declines expo- 
nentially with rank. This means that while Zipf's law implies a finite total population, Turing's 
formula yields a proper probability distribution also for infinite populations. 

In fact, it is tempting to interpret Turing's formula as smoothing the relative-frequency 
estimates towards a geometric distribution. This could potentially be used to improve sparse- 
data estimates by assuming a geometric distribution (tail), and introducing a ranking based 
on direct frequency counts, frequency counts when backing off to more general conditionings, 
order of appearance in the training data, or, to break any remaining ties, lexicographical order. 

Conversely, a local reestimation formula in the vein of Turing's formula was derived from 
Zipf's law. Although the two equations are similar, Turing's formula shifts the frequency mass 
towards more frequent species. The two cases were generalized to a single spectrum of reesti- 
mation formulas and corresponding asymptotes, parameterized by one real-valued parameter. 
Furthermore, the two cases correspond to the upper and lower bounds of this parameter for 
which the cumulative of the frequency function converges as rank tends to infinity. 

These results are in sharp contrast to common belief in the field; in [Baayen 1991], for 
example, we read: "Other models, such as Good (1953) . . . have been put forward, all of 
which have Zipf's law as some special or limiting form." All of the Zipf-Simon-Mandelbrot 
distributions exhibit the same basic asymptotic behavior, 

m = £ 

parameterized by the positive real-valued parameter (3. Comparing this with Eq. (15), we find 

that (3 = > and thus = — > 1. In view of the established exponentially declining 

9 — 1 p 

asymptote of the ideal Turing distribution, corresponding to 9 = 1, we can conclude that the 
latter is qualitatively different. 
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